Introduction¶
In this notebook, we will create a cumulative report from all datasets with different economic indicators. We have already cleaned and preprocessed the datasets. We will import them and merge them using a common key (e.g., quarters) into a single dataframe. We will align timeframes if datasets have different frequencies (e.g., monthly vs. quarterly). After then, we will analyze and visualise the data.
import pandas as pd
import numpy as np
from functools import reduce
import my_data_tools as dt
import plotly.express as px
import plotly.graph_objects as go
import math
df1 = pd.read_csv("../cleaned_csvs/Unemployment.csv") # Data Indicator 1 - Unemployment
df2 = pd.read_csv("../cleaned_csvs/Employment by Sector.csv") # Data Indicator 2 - Employment by Sector
df3 = pd.read_csv("../cleaned_csvs/Residential Property Price Index.csv") # Data Indicator 3 - RPPI
df4 = pd.read_csv("../cleaned_csvs/RTB rents.csv") # Data Indicator 4 - RTB rents
df5 = pd.read_csv("../cleaned_csvs/Dublin Housing.csv") # Data Indicator 5 - Dublin Housing
df6 = pd.read_csv("../cleaned_csvs/PUBLIC TRANSPORT.csv") # Data Indicator 8 - PUBLIC TRANSPORT
df7 = pd.read_csv("../cleaned_csvs/Dublin Airport Passenger Arrivals.csv") # Data Indicator 9 - Dublin Airport Passenger Arrivals
df8 = pd.read_csv("../cleaned_csvs/Dublin Port Tonnage.csv") # Data Indicator 10 - Dublin Port Tonnage
df9 = pd.read_csv("../cleaned_csvs/Dublin current conditions.csv") # Data ESRI KBC Dublin current conditions
df10 = pd.read_csv("../cleaned_csvs/Dublin expectations.csv") # Data ESRI KBC Dublin expectations
df11 = pd.read_csv("../cleaned_csvs/Dublin sentiment overall.csv") # Data ESRI KBC Dublin sentiment overall
Data pre-processing¶
In the following cell, we are working on dataframe # 3. We are converting it's 'Month Year' column to 'Quarter'.
# Convert 'Month Year' to datetime for easier manipulation
df3['Date'] = pd.to_datetime(df3['Month Year'], format='%b %y')
# Extract year and month
df3['Year'] = df3['Date'].dt.strftime('%y') # Extract two-digit year
df3['Month'] = df3['Date'].dt.month
# The following variable will help us navigate through all years
years = df3['Year'].unique()
# list that represents months falling in each quarter i.e. 1 is January and 2 is February and so on
quarters = [[1,3],[4,6],[7,9],[10,12]]
# list of columns need to create new dataframe
columns = df3.columns.tolist()
columns[0] = 'Quarter'
columns = columns[:9]
# New dataframe to keep mean values for all the columns
new_rppi = pd.DataFrame(columns=columns)
# list to append records to be later saved in a dataframe
dict_list=[]
# only choose selective columns
selective_columns = columns[1:]
# This function constructs a dictionary (row_dict) that represents a single row of data to be added to a DataFrame.
# The dictionary keys are column names, and the values are the corresponding data for those columns.
def record_data_in_dataframe(y, selective_columns, average_value, quarter):
# Create a dictionary to represent a row of data
row_dict={'Quarter':f'{quarter} {y}',
selective_columns[0]:average_value.iloc[0],
selective_columns[1]:average_value.iloc[1],
selective_columns[2]:average_value.iloc[2],
selective_columns[3]:average_value.iloc[3],
selective_columns[4]:average_value.iloc[4],
selective_columns[5]:average_value.iloc[5],
selective_columns[6]:average_value.iloc[6],
selective_columns[7]:average_value.iloc[7],
}
return row_dict
for y in years: # loop through each year
for q in quarters: # looping through every quarter
# q[0] is the starting month of specific quarter
# q[1] is the ending month of specific quarter
# perform a filter of particular quarter and year
filtered_df = df3[(df3['Month'] >= q[0]) & (df3['Month'] <= q[1]) & (df3['Year'] == y)]
average_value = filtered_df[selective_columns].mean().round(2)
# the following if statements help store correct Quarter in the dataframe
# if the months are between 1 and 3, then it's Q1
if all(1 <= x <= 3 for x in q):
row_dict = record_data_in_dataframe(y, selective_columns, average_value, quarter='Q1')
# if the months are between 4 and 6, then it's Q2
elif all(4 <= x <= 6 for x in q):
row_dict = record_data_in_dataframe(y, selective_columns, average_value, quarter='Q2')
# if the months are between 7 and 9, then it's Q3
elif all(7 <= x <= 9 for x in q):
row_dict = record_data_in_dataframe(y, selective_columns, average_value, quarter='Q3')
# if the months are between 10 and 12, then it's Q4
elif all(10 <= x <= 12 for x in q):
row_dict = record_data_in_dataframe(y, selective_columns, average_value, quarter='Q4')
dict_list.append(row_dict)
new_rppi = pd.DataFrame.from_dict(dict_list)
# revert back to original name
df3 = new_rppi
Since, there is one missing value. We will execute data cleaning on df3 dataframe.
# Performing data cleaning on dataframe 3
transformations = dict()
df3 = dt.data_clean(df3, transformations)
In the following cell, we are renaming column names that are identical in 3 datasets. If left unchanged, they will generate errors in merge process.
# Changing column names in dataframe 9
transformations = {
'Year': {'rename': 'Quarter'},
'Dublin': {'rename': 'Dublin (current conditions)'},
'National excl. Dublin': {'rename': 'National (current conditions)'},
}
df9 = dt.data_clean(df9, transformations)
# Changing column names in dataframe 10
transformations = {
'Year': {'rename': 'Quarter'},
'Dublin': {'rename': 'Dublin (expectations)'},
'National excl.Dublin': {'rename': 'National (expectations)'},
}
df10 = dt.data_clean(df10, transformations)
# Changing column names in dataframe 11
transformations = {
'Year': {'rename': 'Quarter'},
'Dublin': {'rename': 'Dublin (sentiment overall)'},
'National excl. Dublin': {'rename': 'National (sentiment overall)'},
}
df11 = dt.data_clean(df11, transformations)
Performing merge¶
# List of DataFrames
dfs = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11]
In this cell, I am ranging the list of quarters present in each dataframe. Since the merge happens on Quarter column of each dataframe, the dataframe with the least number of years will decide the size of final dataset. Depending on the following information, we can remove dataframes with least years to increase the overall size of the final dataset.
df1 --- Q3 99 -- Q3 24. (25 years)
df2 --- Q1 98 -- Q3 24. (26 years)
df3 --- Q1 06 -- Q3 24. (18 years)
df4 --- Q3 08 -- Q1 24. (16 years)
df5 --- Q1 12 -- Q3 24. (12 years)
df6 --- Q3 11 -- Q3 24. (13 years)
df7 --- Q4 14 -- Q3 24. (10 years)
df8 --- Q3 08 -- Q3 24. (16 years)
df9 --- Q1 08 -- Q4 16. (8 years)
df10 -- Q1 08 -- Q4 16. (8 years)
df11 -- Q1 08 -- Q4 16. (8 years)
final_df = reduce(lambda left, right: pd.merge(left, right, on='Quarter', how='inner'), dfs)
Our final dataset looks like as follows. It has 9 rows and 87 columns. The reason behind the small sample dataset is because there are dataframes with only 8 years. The cells for other dataframes are left empty since there is no data for those years. The drop na function drops those years.
final_df.head()
| Quarter | National Unemployment Rate SA (%) | Dublin Unemployed SA (000) | Dublin Unemployment Rate SA (%) | Dublin Employed SA (000) | QoQ Dublin Employed | QoQ Dublin Unemployment Rate | YoY Dublin Employed | %YoY Dublin Unemployed | Dublin - National Unemployment Rate SA (%) | ... | YoY% | QoQ% | Import | Export | Dublin (current conditions) | National (current conditions) | Dublin (expectations) | National (expectations) | Dublin (sentiment overall) | National (sentiment overall) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Q4 14 | 10.8 | 62.5 | 9.3 | 605.5 | 14.8 | -1.1 | 27.957 | -1.4 | -1.5 | ... | 6.00 | 0.38 | 4660749 | 3131590 | 92.661892 | 99.356112 | 174.822892 | 173.332176 | 132.100009 | 137.208274 |
| 1 | Q1 15 | 10.7 | 61.1 | 9.2 | 603.8 | -1.6 | 0.0 | 19.621 | -1.7 | -1.5 | ... | 5.63 | 3.14 | 4803605 | 3225965 | 102.663236 | 114.107698 | 198.998799 | 188.849562 | 148.905288 | 152.351706 |
| 2 | Q2 15 | 10.0 | 52.1 | 7.9 | 609.4 | 5.6 | -1.3 | 20.197 | -2.1 | -2.1 | ... | 4.41 | 0.19 | 4710315 | 3334840 | 110.895130 | 111.359357 | 203.944343 | 194.441296 | 155.559700 | 153.870825 |
| 3 | Q3 15 | 9.6 | 53.8 | 8.0 | 618.9 | 9.5 | 0.2 | 28.172 | -2.3 | -1.6 | ... | 6.84 | 3.01 | 4928335 | 3358643 | 100.900800 | 117.274387 | 197.426303 | 194.719195 | 147.234025 | 156.901441 |
| 4 | Q4 15 | 9.5 | 55.5 | 8.1 | 627.7 | 8.9 | 0.0 | 22.266 | -1.2 | -1.4 | ... | 9.08 | 2.47 | 5118851 | 3373220 | 102.272674 | 118.651234 | 207.146525 | 200.642119 | 152.613192 | 160.604430 |
5 rows × 87 columns
There is only 1 string column (Quarter), 85 integer / float columns and 1 boolean column.
final_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9 entries, 0 to 8 Data columns (total 87 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Quarter 9 non-null object 1 National Unemployment Rate SA (%) 9 non-null float64 2 Dublin Unemployed SA (000) 9 non-null float64 3 Dublin Unemployment Rate SA (%) 9 non-null float64 4 Dublin Employed SA (000) 9 non-null float64 5 QoQ Dublin Employed 9 non-null float64 6 QoQ Dublin Unemployment Rate 9 non-null float64 7 YoY Dublin Employed 9 non-null float64 8 %YoY Dublin Unemployed 9 non-null float64 9 Dublin - National Unemployment Rate SA (%) 9 non-null float64 10 Construction 9 non-null float64 11 Wholesale and retail trade, repair of motor vehicles and motorcycles 9 non-null float64 12 Transportation and storage 9 non-null float64 13 Accommodation and food service activities 9 non-null float64 14 Information and communication 9 non-null float64 15 Professional, scientific and technical activities 9 non-null float64 16 Administrative and support service activities 9 non-null float64 17 Public administration and defence, compulsory social security 9 non-null float64 18 Education 9 non-null float64 19 Human health and social work activities 9 non-null float64 20 Industry 9 non-null float64 21 Industry and Construction 9 non-null float64 22 Services 9 non-null float64 23 Financial, insurance and real estate activities 9 non-null float64 24 Other NACE activities 9 non-null float64 25 Private Sector 9 non-null float64 26 Total 9 non-null float64 27 All Dublin properties 9 non-null float64 28 All National properties Excl. Dublin 9 non-null float64 29 % YoY Dublin 9 non-null float64 30 % YoY National excl. Dublin 9 non-null float64 31 MoM Dublin 9 non-null float64 32 MoM National excl. Dublin 9 non-null float64 33 %MoM Dublin 9 non-null float64 34 %MoM National excl. Dublin 9 non-null float64 35 Dublin 9 non-null int64 36 Greater Dublin Area 9 non-null int64 37 Outside GDA 9 non-null int64 38 %YoY Dublin 9 non-null float64 39 %YoY Greater Dublin 9 non-null float64 40 %YoY Outside GDA 9 non-null float64 41 YoY Dublin 9 non-null int64 42 QoQ Dublin 9 non-null int64 43 %QoQ Dublin 9 non-null float64 44 Completions NEW 9 non-null int64 45 Commencements 9 non-null int64 46 completions___commencements 9 non-null bool 47 unsafe__yoy_completions 9 non-null float64 48 unsafe__yoy_commencements 9 non-null float64 49 YoY Completions 9 non-null float64 50 YoY Commencements 9 non-null float64 51 unsafe__qoq_completions 9 non-null float64 52 unsafe__qoq_commencements 9 non-null float64 53 QoQ Completions 9 non-null float64 54 QoQ Commencements 9 non-null float64 55 Public Transport million trips2 9 non-null float64 56 Bus Éireann 9 non-null float64 57 Dublin City Bus 9 non-null float64 58 Irish Rail 9 non-null float64 59 Luas 9 non-null float64 60 QoQ Public Transport 9 non-null float64 61 %YoY Public Transport 9 non-null float64 62 %QoQ Public Transport 9 non-null float64 63 %QoQ Bus Eireann 9 non-null float64 64 %QoQ Dub City Bus 9 non-null float64 65 %QoQ Irish Rail 9 non-null float64 66 %QoQ Luas 9 non-null float64 67 Dublin Airport Total Pax Seasonally Adjusted by CSO 9 non-null int64 68 Total (m) 9 non-null float64 69 % YoY 9 non-null float64 70 %QoQ 9 non-null float64 71 YoY 9 non-null float64 72 % Average YoY 9 non-null float64 73 Changes in previous four quarters 9 non-null float64 74 Sum of previous four quarters 9 non-null float64 75 Dublin Port Tonnage Total 9 non-null int64 76 Dublin Port Tonnage Total/1000000 9 non-null float64 77 YoY% 9 non-null float64 78 QoQ% 9 non-null float64 79 Import 9 non-null int64 80 Export 9 non-null int64 81 Dublin (current conditions) 9 non-null float64 82 National (current conditions) 9 non-null float64 83 Dublin (expectations) 9 non-null float64 84 National (expectations) 9 non-null float64 85 Dublin (sentiment overall) 9 non-null float64 86 National (sentiment overall) 9 non-null float64 dtypes: bool(1), float64(74), int64(11), object(1) memory usage: 6.2+ KB
Data Visualisation¶
# Rename columns for legend
df_renamed = final_df.rename(columns={'QoQ Dublin Employed': 'People employed', # Unemployment
'QoQ Dublin': 'Rents in Dublin', # RTB Rents
'QoQ Completions': 'Houses completed', # Dublin Housing
'QoQ Commencements': 'Houses commenced', # Dublin Housing
'QoQ Public Transport': 'Public Transport' # Public Transport
})
x = df_renamed['Quarter']
y = ['People employed', 'Rents in Dublin', 'Houses completed', 'Houses commenced', 'Public Transport']
chart_type = "bar" # 'line', 'bar', 'scatter'
title="Quarter-on-Quarter changes for different indicators"
y_axis_label = "QoQ change"
fig = dt.visualize_data(df_renamed, x, y, chart_type, title, y_axis_label)
fig.show()
The above visualisation indicates that housing had been the most changing variable between 2014 and 2016. Within housing, houses that are commenced completely dwarf houses that are being completed. Hence, houses that are delivered are less than the houses that are being started to be built.
Employment & Unemployment¶
Visualisation # 1 - People employed vs unemployed in Dublin¶
x = df1['Quarter']
y = ['Dublin Unemployed SA (000)', 'Dublin Employed SA (000)']
chart_type = "bar" # 'line', 'bar', 'scatter'
title="People employed vs unemployed in Dublin"
y_axis_label = "In thousands"
fig = dt.visualize_data(df1, x, y, chart_type, title, y_axis_label)
fig.show()
Visualisation # 2 - Employment status comparison: 1998 vs 2024¶
from plotly.subplots import make_subplots
import plotly.graph_objects as go
# dt.get_df(dataframe, quarter and the specific year, list of chosen columns, decision, name of the category, name of the values)
df_1999 = dt.get_df(df1, 'Q3 99', ['Dublin Unemployed SA (000)', 'Dublin Employed SA (000)'], 'selective', 'Employment Status', 'Employees (000)')
df_2024 = dt.get_df(df1, 'Q3 24', ['Dublin Unemployed SA (000)', 'Dublin Employed SA (000)'], 'selective', 'Employment Status', 'Employees (000)')
# Create subplots with 1 row and 2 columns
fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'pie'}, {'type': 'pie'}]])
# Add the first pie chart to the first subplot
fig.add_trace(
go.Pie(labels=df_1999['Employment Status'], values=df_1999['Employees (000)'], name="Q3 1999"),
row=1, col=1
)
# Add the second pie chart to the second subplot
fig.add_trace(
go.Pie(labels=df_2024['Employment Status'], values=df_2024['Employees (000)'], name="Q3 2024"),
row=1, col=2
)
# Update layout for better appearance
fig.update_layout(
title_text="Side by Side comparison: 1999 vs 2024",
width=1000, # Adjust the width of the entire figure
height=500, # Adjust the height of the entire figure
showlegend=True # Enable/Disable the legend
)
# Show the figure
fig.show()
The above visualisation compares people employed in Dublin vs people unemployed in Dublin between years 1999 and 2024. In year 1999, 95.6% people were employed in Dublin whereas after 25 years (in 2024), 95.3% people were employed in Dublin. On the other hand, people unemployed in Dublin in year 1999 were 4.36% and in 2024, 4.74% people were unemployed.
In short, there is no significant difference. People employed in Dublin reduced by 0.3% in 25 years whereas unemployed people grew by 0.4%.
Visualisation # 3 - Employment sector comparison: 1998 vs 2024¶
def get_df(df, quarter):
df = df.loc[df['Quarter']==quarter, "Construction":"Private Sector"]
# Reshaping the data
df_melted = df.melt(var_name='Employment Sector', value_name='Employees (000)')
return df_melted
# dt.get_df(dataframe, quarter and the specific year, list of chosen columns, decision, name of the category, name of the values)
df_1998 = dt.get_df(df2, 'Q1 98', ['Construction','Private Sector'], 'range', 'Employment Sector', 'Employees (000)')
df_2024 = dt.get_df(df2, 'Q3 24', ['Construction','Private Sector'], 'range', 'Employment Sector', 'Employees (000)')
# Create subplots with 1 row and 2 columns
fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'pie'}, {'type': 'pie'}]])
# Add the first pie chart to the first subplot
fig.add_trace(
go.Pie(labels=df_1998['Employment Sector'], values=df_1998['Employees (000)'], name="Q1 1998"),
row=1, col=1
)
# Add the second pie chart to the second subplot
fig.add_trace(
go.Pie(labels=df_2024['Employment Sector'], values=df_2024['Employees (000)'], name="Q3 2024"),
row=1, col=2
)
# Update layout for better appearance
fig.update_layout(
title_text="Side by Side comparison: 1998 vs 2024",
width=1200, # Adjust the width of the entire figure
height=500, # Adjust the height of the entire figure
showlegend=True # Enable/Disable the legend
)
# Show the figure
fig.show()
In the above visualisation, it is observed that
- In 1998, 'Private Sector' and 'Services' sector are the top 2 biggest employment sectors in Dublin. Both share the same 28.5% of the all the sectors.
- In 2024, 'Services' sector grew bigger and became widely adopted employment sector.
- The 3rd most widely adopted employment sector in 1998 was 'Industry and Construction'. In 2024, this is replaced with 'Human health and social work activities'.
Real Estate (RPPI, RTB Rents, Housing)¶
Visualisation # 1 - Residential Property Price Index (RPPI) trends over time¶
x = df3['Quarter']
y = df3.columns
chart_type = "line" # 'line', 'bar', 'scatter'
title="Property Price Index: Dublin vs National"
y_axis_label = 'Property Price Index'
fig = dt.visualize_data(df3, x, y[1:3], chart_type, title, y_axis_label)
fig.show()
Visualisation # 2 - Distribution of rental prices (RTB rents) across different Dublin regions.¶
new_df4 = df4.loc[:,['Dublin', 'Greater Dublin Area', 'Outside GDA']]
melted_df4 = pd.melt(new_df4, var_name='Areas', value_name='Rents')
fig = px.box(melted_df4, x="Areas", y="Rents")
fig.show()
In the above visualisation, it is observed that
- Dublin is the most expensive in terms of rent. Followed by Greater Dublin Area and Outside GDA.
Between 2008 and 2024, it is observed that
- The minimum rent recorded in Dublin is 1037 and maximum is 2065.
- The minimum rent recorded in Greater Dublin Area is 759 and maximum is 1516.
- The minimum rent recorded outside GDA is 539 and maximum is 1039.
Transport (Public, Airport, Port)¶
Visualisation # 1 - Use of public transport in Dublin¶
data = df6
# Convert to long format for Plotly
df_long = data.melt(id_vars=['Quarter', 'Public Transport million trips2'],
value_vars=['Bus Éireann', 'Dublin City Bus', 'Irish Rail', 'Luas'],
var_name='Transport_Type', value_name='Trips')
# Calculate bubble size
df_long['Size'] = df_long['Trips'].apply(lambda x: math.sqrt(x) * 10)
# Create Figure
fig = go.Figure()
# Define colors for transport types
colors = {
'Bus Éireann': 'red',
'Dublin City Bus': 'blue',
'Irish Rail': 'green',
'Luas': 'purple'
}
for transport in df_long['Transport_Type'].unique():
subset = df_long[df_long['Transport_Type'] == transport]
fig.add_trace(go.Scatter(
x=subset['Quarter'],
y=subset['Trips'],
name=transport,
text=subset['Transport_Type'],
marker=dict(size=subset['Size'], color=colors[transport], sizemode='area', line_width=2),
mode='markers'
))
# Update layout
fig.update_layout(
title="Public Transport Trips by Type",
xaxis=dict(title="Quarter"),
yaxis=dict(title="Trips (millions)"),
paper_bgcolor='rgb(243, 243, 243)',
plot_bgcolor='rgb(243, 243, 243)',
)
fig.show()
In the above visualisation, it is observed that
- Use of public transport is on the rise but not a lot.
- In 2011, 44 million trips were made and 69.1 million by 2024.
- The use of public transport plunged in 2020 due to COVID. It didn't reach normal levels of usage until end of 2022 and start of 2023.
Visualisation # 2 - Trends of Public Transport usage, Airport Arrivals, and Port Tonnage¶
# List of DataFrames
dfs = [df6, df7, df8]
transport_df = reduce(lambda left, right: pd.merge(left, right, on='Quarter', how='inner'), dfs)
# Rename columns for legend
df_renamed = transport_df.rename(columns={'Public Transport million trips2': 'Public Transport', # PUBLIC TRANSPORT
'Total (m)': 'Airport Passenger Arrivals', # Dublin Airport Passenger Arrivals
'Dublin Port Tonnage Total/1000000': 'Port Tonnage', # Dublin Port Tonnage
})
x = df_renamed['Quarter']
y = ['Public Transport', 'Airport Passenger Arrivals', 'Port Tonnage']
chart_type = "line" # 'line', 'bar', 'scatter'
title="Trends of Public Transport, Airport Arrivals, and Port Tonnage"
y_axis_label = "In thousands"
fig = dt.visualize_data(df_renamed, x, y, chart_type, title, y_axis_label)
fig.show()
Visualisation # 3 - Compare transportation modes in terms of growth percentage¶
# Rename columns for legend
df_renamed = df6.rename(columns={'%QoQ Bus Eireann': '% Bus Eireann',
'%QoQ Dub City Bus': '% Dublin City Bus',
'%QoQ Irish Rail': '% Irish Rail',
'%QoQ Luas': '% Luas'
})
x = df_renamed['Quarter']
y = ['% Bus Eireann', '% Dublin City Bus', '% Irish Rail', '% Luas']
chart_type = "line" # 'line', 'bar', 'scatter'
title="Transportation modes in terms of growth percentage"
y_axis_label = "% Quarter on Quarter change"
fig = dt.visualize_data(df_renamed, x, y, chart_type, title, y_axis_label)
fig.show()
Visualisation # 4 - Share of total passengers using different transport modes in the latest quarter¶
df = dt.get_df(df6, 'Q3 24', ['Bus Éireann','Luas'], 'range','Transportation modes','Trips')
fig = px.pie(df, values='Trips', names='Transportation modes', title='Trips using multiple transport modes in Q3 2024')
# Update layout for better appearance
fig.update_layout(
width=1000, # Adjust the width of the entire figure
height=600, # Adjust the height of the entire figure
showlegend=True # Enable/Disable the legend
)
fig.show()